HELOC DATA LOAN Risk Flag Prediction

by group4

HelocData is the anonymized data of Home Equity Line of Credit (HELOC) loans. This project is to predict the risk flag ('RiskFlag') which indicates whether the loan is ever 90-day delinquent over a two-year period.

Data Statistics

Before reading the csv file, it is important to note that

-7: Record or No Investigation -8: Usable/Valid trades or inquiries -9: Condition Not Met (e.g., No Delinquencies, No enquiries)

all represent the missing information. (See HelocDataDict, SpecialValues) For each feature, all these missing values will be treated as NaN.

In [0]:
import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/choijaewon959/HELOC-RiskFlag-Prediction/master/HelocData.csv'

df = pd.read_csv(url, na_values = [-7, -8, -9])
df.head()
Out[0]:
RiskFlag x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23
0 Bad 75.0 169.0 2.0 59.0 21.0 0.0 0.0 100.0 NaN 7.0 8.0 22.0 4.0 36.0 NaN 4.0 4.0 43.0 112.0 4.0 6.0 0.0 83.0
1 Bad 66.0 502.0 4.0 145.0 34.0 0.0 0.0 97.0 36.0 6.0 6.0 37.0 4.0 27.0 4.0 3.0 3.0 80.0 53.0 17.0 3.0 12.0 83.0
2 Good 69.0 338.0 2.0 62.0 22.0 0.0 0.0 96.0 12.0 6.0 6.0 23.0 3.0 35.0 0.0 4.0 4.0 25.0 100.0 3.0 2.0 1.0 45.0
3 Good 75.0 422.0 1.0 91.0 55.0 0.0 0.0 100.0 NaN 7.0 8.0 57.0 4.0 33.0 0.0 4.0 4.0 2.0 11.0 12.0 2.0 1.0 57.0
4 Bad 63.0 242.0 2.0 68.0 25.0 0.0 0.0 100.0 NaN 7.0 8.0 26.0 1.0 19.0 NaN 3.0 3.0 73.0 NaN 12.0 1.0 5.0 87.0

Change the categorical label to 0 and 1.

In [0]:
df['RiskFlag'] = df['RiskFlag'].replace(['Good', 'Bad'], [1, 0])
df.head()
Out[0]:
RiskFlag x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23
0 0 75.0 169.0 2.0 59.0 21.0 0.0 0.0 100.0 NaN 7.0 8.0 22.0 4.0 36.0 NaN 4.0 4.0 43.0 112.0 4.0 6.0 0.0 83.0
1 0 66.0 502.0 4.0 145.0 34.0 0.0 0.0 97.0 36.0 6.0 6.0 37.0 4.0 27.0 4.0 3.0 3.0 80.0 53.0 17.0 3.0 12.0 83.0
2 1 69.0 338.0 2.0 62.0 22.0 0.0 0.0 96.0 12.0 6.0 6.0 23.0 3.0 35.0 0.0 4.0 4.0 25.0 100.0 3.0 2.0 1.0 45.0
3 1 75.0 422.0 1.0 91.0 55.0 0.0 0.0 100.0 NaN 7.0 8.0 57.0 4.0 33.0 0.0 4.0 4.0 2.0 11.0 12.0 2.0 1.0 57.0
4 0 63.0 242.0 2.0 68.0 25.0 0.0 0.0 100.0 NaN 7.0 8.0 26.0 1.0 19.0 NaN 3.0 3.0 73.0 NaN 12.0 1.0 5.0 87.0

Now, let's read the explanation data dictionary.

In [0]:
from google.colab import files
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving HelocDataDict.xlsx to HelocDataDict (5).xlsx
In [0]:
import io
data_dict = pd.read_excel(io.BytesIO(uploaded['HelocDataDict.xlsx']))

# Heloc_data_dict_excel_path = 'https://github.com/choijaewon959/HELOC-RiskFlag-Prediction/blob/master/HelocDataDict.xlsx'
# data_dict = pd.read_excel(Heloc_data_dict_excel_path, 'Data Dictionary', header=None).iloc[:,:2]

#Heloc_data_dict_xls = pd.ExcelFile('HelocDataDict.xlsx')
#data_dict = pd.read_excel(Heloc_data_dict_xls, 'Data Dictionary').iloc[:,:2]

data_dict
Out[0]:
Variable Names Description
0 RiskFlag Paid as negotiated flag (12-36 Months). String...
1 x1 Consolidated version of risk markers
2 x2 Months Since Oldest Trade Open
3 x3 Months Since Most Recent Trade Open
4 x4 Average Months in File
5 x5 Number Satisfactory Trades
6 x6 Number Trades 60+ Ever
7 x7 Number Trades 90+ Ever
8 x8 Percent Trades Never Delinquent
9 x9 Months Since Most Recent Delinquency
10 x10 Max Delq/Public Records Last 12 Months. See ta...
11 x11 Max Delinquency Ever. See tab "MaxDelq" for ea...
12 x12 Number of Total Trades (total number of credit...
13 x13 Number of Trades Open in Last 12 Months
14 x14 Percent Installment Trades
15 x15 Months Since Most Recent Inq excl 7days
16 x16 Number of Inq Last 6 Months
17 x17 Number of Inq Last 6 Months excl 7days. Exclud...
18 x18 Net Fraction Revolving Burden. This is revolvi...
19 x19 Net Fraction Installment Burden. This is insta...
20 x20 Number Revolving Trades with Balance
21 x21 Number Installment Trades with Balance
22 x22 Number Bank/Natl Trades w high utilization ratio
23 x23 Percent Trades with Balance

Now, let's take a look at the missing value frequencies before the actual cleaning of the data.

In [0]:
# count the missing value frequency
df_miss_num = df.iloc[:, 1:].isnull().sum()
df_total_num = df.shape[0]
df_miss_freq = df_miss_num / df_total_num                       

print(df_miss_freq.apply(lambda x: format(x, '.2%')).to_string())
x1      5.72%
x2      7.91%
x3      5.62%
x4      5.62%
x5      5.62%
x6      5.62%
x7      5.62%
x8      5.62%
x9     51.90%
x10     5.62%
x11     5.62%
x12     5.62%
x13     5.62%
x14     5.62%
x15    27.91%
x16     5.62%
x17     5.62%
x18     7.40%
x19    38.31%
x20     7.11%
x21    13.85%
x22    11.20%
x23     5.79%
In [0]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

plt.figure(figsize=(12, 6))
df_miss_freq.plot.bar(width=0.75, rot=0, ax=plt.gca())
plt.ylabel('Frequencies')
plt.title('Bar Plot of Missing Value Frequencies', fontsize=15)
plt.show()

It is important to note that over half of the values of x9 are missing and therefore imputation of the values based only on existing values may contain biases (For the same reason, x15 and x19 may require other imputation methods or should be dropped).

Deeper Data Interpretation and Preprocessing

Renaming the features

For the readability of the data, let's use the variable name in the dictionary by merging the description of the data dictionary with the raw data.

In [0]:
var_names = data_dict['Description'][1:].str.split('(\.| \()', expand=True).iloc[:,0]
var_names
Out[0]:
1                 Consolidated version of risk markers
2                       Months Since Oldest Trade Open
3                  Months Since Most Recent Trade Open
4                               Average Months in File
5                           Number Satisfactory Trades
6                               Number Trades 60+ Ever
7                               Number Trades 90+ Ever
8                      Percent Trades Never Delinquent
9                 Months Since Most Recent Delinquency
10              Max Delq/Public Records Last 12 Months
11                                Max Delinquency Ever
12                              Number of Total Trades
13             Number of Trades Open in Last 12 Months
14                          Percent Installment Trades
15             Months Since Most Recent Inq excl 7days
16                         Number of Inq Last 6 Months
17              Number of Inq Last 6 Months excl 7days
18                       Net Fraction Revolving Burden
19                     Net Fraction Installment Burden
20                Number Revolving Trades with Balance
21              Number Installment Trades with Balance
22    Number Bank/Natl Trades w high utilization ratio
23                         Percent Trades with Balance
Name: 0, dtype: object
In [0]:
df.columns = np.concatenate((['RiskFlag'], var_names), axis=0)
df.head()
Out[0]:
RiskFlag Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Months Since Most Recent Delinquency Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Months Since Most Recent Inq excl 7days Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Net Fraction Installment Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 0 75.0 169.0 2.0 59.0 21.0 0.0 0.0 100.0 NaN 7.0 8.0 22.0 4.0 36.0 NaN 4.0 4.0 43.0 112.0 4.0 6.0 0.0 83.0
1 0 66.0 502.0 4.0 145.0 34.0 0.0 0.0 97.0 36.0 6.0 6.0 37.0 4.0 27.0 4.0 3.0 3.0 80.0 53.0 17.0 3.0 12.0 83.0
2 1 69.0 338.0 2.0 62.0 22.0 0.0 0.0 96.0 12.0 6.0 6.0 23.0 3.0 35.0 0.0 4.0 4.0 25.0 100.0 3.0 2.0 1.0 45.0
3 1 75.0 422.0 1.0 91.0 55.0 0.0 0.0 100.0 NaN 7.0 8.0 57.0 4.0 33.0 0.0 4.0 4.0 2.0 11.0 12.0 2.0 1.0 57.0
4 0 63.0 242.0 2.0 68.0 25.0 0.0 0.0 100.0 NaN 7.0 8.0 26.0 1.0 19.0 NaN 3.0 3.0 73.0 NaN 12.0 1.0 5.0 87.0
In [0]:
df_X = df.iloc[:,1:]
df_y = df.iloc[:,0]

df_X.head()
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Months Since Most Recent Delinquency Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Months Since Most Recent Inq excl 7days Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Net Fraction Installment Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 75.0 169.0 2.0 59.0 21.0 0.0 0.0 100.0 NaN 7.0 8.0 22.0 4.0 36.0 NaN 4.0 4.0 43.0 112.0 4.0 6.0 0.0 83.0
1 66.0 502.0 4.0 145.0 34.0 0.0 0.0 97.0 36.0 6.0 6.0 37.0 4.0 27.0 4.0 3.0 3.0 80.0 53.0 17.0 3.0 12.0 83.0
2 69.0 338.0 2.0 62.0 22.0 0.0 0.0 96.0 12.0 6.0 6.0 23.0 3.0 35.0 0.0 4.0 4.0 25.0 100.0 3.0 2.0 1.0 45.0
3 75.0 422.0 1.0 91.0 55.0 0.0 0.0 100.0 NaN 7.0 8.0 57.0 4.0 33.0 0.0 4.0 4.0 2.0 11.0 12.0 2.0 1.0 57.0
4 63.0 242.0 2.0 68.0 25.0 0.0 0.0 100.0 NaN 7.0 8.0 26.0 1.0 19.0 NaN 3.0 3.0 73.0 NaN 12.0 1.0 5.0 87.0

Below is the statistic of the data.

In [0]:
data_stat = df.describe().T
data_stat
Out[0]:
count mean std min 25% 50% 75% max
RiskFlag 10459.0 0.478057 0.499542 0.0 0.0 0.0 1.0 1.0
Consolidated version of risk markers 9861.0 72.060440 9.871795 33.0 64.0 72.0 80.0 94.0
Months Since Oldest Trade Open 9632.0 200.769103 97.946081 2.0 135.0 186.0 257.0 803.0
Months Since Most Recent Trade Open 9871.0 9.588492 12.963398 0.0 3.0 6.0 12.0 383.0
Average Months in File 9871.0 78.778138 34.066063 4.0 57.0 76.0 97.0 383.0
Number Satisfactory Trades 9871.0 21.121467 11.321396 0.0 13.0 20.0 28.0 79.0
Number Trades 60+ Ever 9871.0 0.581400 1.238783 0.0 0.0 0.0 1.0 19.0
Number Trades 90+ Ever 9871.0 0.384763 0.993223 0.0 0.0 0.0 0.0 19.0
Percent Trades Never Delinquent 9871.0 92.359943 11.772876 0.0 89.0 97.0 100.0 100.0
Months Since Most Recent Delinquency 5031.0 21.879547 20.808514 0.0 5.0 15.0 34.0 83.0
Max Delq/Public Records Last 12 Months 9871.0 5.757978 1.644518 0.0 5.0 6.0 7.0 9.0
Max Delinquency Ever 9871.0 6.374531 1.849186 2.0 6.0 6.0 8.0 8.0
Number of Total Trades 9871.0 22.635498 12.999924 0.0 13.0 21.0 30.0 104.0
Number of Trades Open in Last 12 Months 9871.0 1.863844 1.828099 0.0 0.0 1.0 3.0 19.0
Percent Installment Trades 9871.0 34.618681 17.953432 0.0 21.0 33.0 45.0 100.0
Months Since Most Recent Inq excl 7days 7540.0 2.477719 4.760413 0.0 0.0 0.0 3.0 24.0
Number of Inq Last 6 Months 9871.0 1.455982 2.136161 0.0 0.0 1.0 2.0 66.0
Number of Inq Last 6 Months excl 7days 9871.0 1.397123 2.096102 0.0 0.0 1.0 2.0 66.0
Net Fraction Revolving Burden 9685.0 34.857718 28.896627 0.0 9.0 29.0 56.0 232.0
Net Fraction Installment Burden 6452.0 68.537973 24.903776 0.0 53.0 74.0 87.0 471.0
Number Revolving Trades with Balance 9715.0 4.102110 3.021621 0.0 2.0 3.0 5.0 32.0
Number Installment Trades with Balance 9010.0 2.484906 1.634503 1.0 1.0 2.0 3.0 23.0
Number Bank/Natl Trades w high utilization ratio 9288.0 1.092270 1.536250 0.0 0.0 1.0 2.0 18.0
Percent Trades with Balance 9853.0 66.449000 22.035459 0.0 50.0 67.0 83.0 100.0

Data Imputation

It is important to preprocess the data so that feature extraction or feature engineering process does not result in biased or misleading results. This part includes data cleaning and data imputation with different criteria based on the nature of each feature.

As discussed earlier, x9 (Months Since Most Recent Delinquency) , x15 (Months Since Most Recent Inq excl 7days) and x19 (Net Fraction Installment Burden) will be dropped for having too many missing values.

In [0]:
df = df.drop(columns=['Months Since Most Recent Delinquency', 'Months Since Most Recent Inq excl 7days', 'Net Fraction Installment Burden'])
df.head()
Out[0]:
RiskFlag Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 0 75.0 169.0 2.0 59.0 21.0 0.0 0.0 100.0 7.0 8.0 22.0 4.0 36.0 4.0 4.0 43.0 4.0 6.0 0.0 83.0
1 0 66.0 502.0 4.0 145.0 34.0 0.0 0.0 97.0 6.0 6.0 37.0 4.0 27.0 3.0 3.0 80.0 17.0 3.0 12.0 83.0
2 1 69.0 338.0 2.0 62.0 22.0 0.0 0.0 96.0 6.0 6.0 23.0 3.0 35.0 4.0 4.0 25.0 3.0 2.0 1.0 45.0
3 1 75.0 422.0 1.0 91.0 55.0 0.0 0.0 100.0 7.0 8.0 57.0 4.0 33.0 4.0 4.0 2.0 12.0 2.0 1.0 57.0
4 0 63.0 242.0 2.0 68.0 25.0 0.0 0.0 100.0 7.0 8.0 26.0 1.0 19.0 3.0 3.0 73.0 12.0 1.0 5.0 87.0

Now, let's take a look at the distribution of the features so that more reasonable impuation methods can be selected.

In [0]:
# histogram for three features
cols = 3
rows = (df.shape[1]-1)//cols+1

fig = plt.figure(figsize=(50, 5*rows))
for i, feature in enumerate(df.columns[1:]):
    ax = fig.add_subplot(rows,cols,i+1)
    
    df[feature].hist(bins=15, density=True, color='green', alpha=0.4, rwidth=0.95, figsize=(15,50))
    df[feature].plot(kind='density', color='green')

    # draw mean and median line
    x1 = df[feature].mean(skipna=True)
    x2 = df[feature].median(skipna=True)
    plt.vlines(x=x1,ymin=0,ymax=0.025,color='r',linestyles=':',label = 'mean')
    plt.vlines(x2,0,0.025,'b','-.',label = 'median')
    
    #range of x
    x_min = data_stat.loc[feature]['min']
    x_max = data_stat.loc[feature]['max']
    
    ax.set_title(feature, fontsize = 10)
    
    plt.legend()
    plt.xlim(x_min,x_max)

plt.show()

Skewed Data

There are some variables which are relatively more skewed than the others. Imputation of these features with their means may lead to biased results by filling the missing values with larger / smaller values than desired. Therefore, these values will be imputed with their medians.

These features include:

Right skewed

Months Since Oldest Trade Open
Number Satisfactory Trades
Number of Trades Open in Last 12 Months
Net Fraction Revolving Burden
Number Revolving Trades with Balance

Left skewed

Percent Trades Never Delinquent

In [0]:
from sklearn.model_selection import train_test_split

random_seed = 4

df_train, df_test = train_test_split(df, test_size = 0.2, random_state = np.random.seed(random_seed))

df_train.shape
Out[0]:
(8367, 21)
In [0]:
df_test.shape
Out[0]:
(2092, 21)
In [0]:
from sklearn.impute import SimpleImputer

skewed_cols = ['Months Since Oldest Trade Open', 'Number Satisfactory Trades', 'Number of Trades Open in Last 12 Months', 
               'Net Fraction Revolving Burden', 'Number Revolving Trades with Balance']

#train
imp = SimpleImputer(missing_values=np.nan, strategy="median")
df_train[skewed_cols] = imp.fit_transform(df_train[skewed_cols])

#test
imp_1 = SimpleImputer(missing_values=np.nan, strategy="median")
df_test[skewed_cols] = imp_1.fit_transform(df_test[skewed_cols])

df_train.head()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
Out[0]:
RiskFlag Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
4178 0 65.0 101.0 2.0 34.0 15.0 0.0 0.0 100.0 7.0 8.0 18.0 5.0 33.0 9.0 9.0 72.0 5.0 3.0 4.0 69.0
9231 0 61.0 149.0 13.0 75.0 4.0 4.0 4.0 100.0 7.0 8.0 8.0 0.0 50.0 0.0 0.0 30.0 3.0 2.0 NaN 100.0
9444 0 56.0 169.0 1.0 56.0 7.0 2.0 2.0 80.0 6.0 3.0 26.0 4.0 20.0 3.0 3.0 62.0 4.0 NaN 1.0 83.0
354 1 87.0 99.0 37.0 99.0 24.0 0.0 0.0 100.0 7.0 8.0 3.0 0.0 29.0 0.0 0.0 11.0 1.0 2.0 0.0 38.0
5494 1 88.0 146.0 16.0 56.0 10.0 0.0 0.0 100.0 7.0 8.0 13.0 0.0 10.0 0.0 0.0 16.0 1.0 NaN 0.0 17.0

Ordinal Data

By looking at the histogram of x10 Max Delq/Public Records Last 12 Months and x11 Max Delinquency Ever, it seems more reasonable to follow the mode of the values because the mode of the data more represents the behavioral trends of the loan for this ordinal features.

In [0]:
ordinal_cols = ['Max Delq/Public Records Last 12 Months', 'Max Delinquency Ever']

imp2 = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
df_train[ordinal_cols] = imp2.fit_transform(df_train[ordinal_cols])

imp2_1 = SimpleImputer(missing_values = np.nan, strategy="most_frequent")
df_test[ordinal_cols] = imp2_1.fit_transform(df_test[ordinal_cols])

df_train.head()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
Out[0]:
RiskFlag Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
4178 0 65.0 101.0 2.0 34.0 15.0 0.0 0.0 100.0 7.0 8.0 18.0 5.0 33.0 9.0 9.0 72.0 5.0 3.0 4.0 69.0
9231 0 61.0 149.0 13.0 75.0 4.0 4.0 4.0 100.0 7.0 8.0 8.0 0.0 50.0 0.0 0.0 30.0 3.0 2.0 NaN 100.0
9444 0 56.0 169.0 1.0 56.0 7.0 2.0 2.0 80.0 6.0 3.0 26.0 4.0 20.0 3.0 3.0 62.0 4.0 NaN 1.0 83.0
354 1 87.0 99.0 37.0 99.0 24.0 0.0 0.0 100.0 7.0 8.0 3.0 0.0 29.0 0.0 0.0 11.0 1.0 2.0 0.0 38.0
5494 1 88.0 146.0 16.0 56.0 10.0 0.0 0.0 100.0 7.0 8.0 13.0 0.0 10.0 0.0 0.0 16.0 1.0 NaN 0.0 17.0

For the other features, Nan values will be imputed with the mean.

In [0]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

columns = df_train.columns

#note that all the special columns above are already imputed. Therefore not affected in this imputation
sim_imputer = SimpleImputer(missing_values = np.nan, strategy="mean")
df_train = pd.DataFrame(sim_imputer.fit_transform(df_train), columns = columns)

sim_imputer_1 = SimpleImputer(missing_values = np.nan, strategy="mean")
df_test = pd.DataFrame(sim_imputer_1.fit_transform(df_test), columns = columns)

df_train.head()
Out[0]:
RiskFlag Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 0.0 65.0 101.0 2.0 34.0 15.0 0.0 0.0 100.0 7.0 8.0 18.0 5.0 33.0 9.0 9.0 72.0 5.0 3.000000 4.000000 69.0
1 0.0 61.0 149.0 13.0 75.0 4.0 4.0 4.0 100.0 7.0 8.0 8.0 0.0 50.0 0.0 0.0 30.0 3.0 2.000000 1.082584 100.0
2 0.0 56.0 169.0 1.0 56.0 7.0 2.0 2.0 80.0 6.0 3.0 26.0 4.0 20.0 3.0 3.0 62.0 4.0 2.475789 1.000000 83.0
3 1.0 87.0 99.0 37.0 99.0 24.0 0.0 0.0 100.0 7.0 8.0 3.0 0.0 29.0 0.0 0.0 11.0 1.0 2.000000 0.000000 38.0
4 1.0 88.0 146.0 16.0 56.0 10.0 0.0 0.0 100.0 7.0 8.0 13.0 0.0 10.0 0.0 0.0 16.0 1.0 2.475789 0.000000 17.0

Data Exploration

Data cleaning and imputation process has been completed. Now, for a better performance of the model and iterpretability of the data, it is essential to understand more about the data.

In [0]:
X_train, y_train = df_train.iloc[:,1:], df_train['RiskFlag']
X_train.head()
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 65.0 101.0 2.0 34.0 15.0 0.0 0.0 100.0 7.0 8.0 18.0 5.0 33.0 9.0 9.0 72.0 5.0 3.000000 4.000000 69.0
1 61.0 149.0 13.0 75.0 4.0 4.0 4.0 100.0 7.0 8.0 8.0 0.0 50.0 0.0 0.0 30.0 3.0 2.000000 1.082584 100.0
2 56.0 169.0 1.0 56.0 7.0 2.0 2.0 80.0 6.0 3.0 26.0 4.0 20.0 3.0 3.0 62.0 4.0 2.475789 1.000000 83.0
3 87.0 99.0 37.0 99.0 24.0 0.0 0.0 100.0 7.0 8.0 3.0 0.0 29.0 0.0 0.0 11.0 1.0 2.000000 0.000000 38.0
4 88.0 146.0 16.0 56.0 10.0 0.0 0.0 100.0 7.0 8.0 13.0 0.0 10.0 0.0 0.0 16.0 1.0 2.475789 0.000000 17.0
In [0]:
y_train.head()
Out[0]:
0    0.0
1    0.0
2    0.0
3    1.0
4    1.0
Name: RiskFlag, dtype: float64
In [0]:
X_test, y_test = df_test.iloc[:,1:], df_test['RiskFlag']
X_test.head()
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 64.000000 150.0 2.000000 50.000000 32.0 1.000000 1.00000 94.000000 0.0 2.0 34.000000 3.0 32.000000 2.000000 2.000000 55.0 7.0 3.000000 3.00000 79.000000
1 70.000000 139.0 5.000000 67.000000 19.0 1.000000 0.00000 83.000000 6.0 5.0 23.000000 2.0 57.000000 0.000000 0.000000 39.0 2.0 3.000000 0.00000 100.000000
2 72.282221 188.0 9.834774 79.676157 20.0 0.566853 0.37214 92.517031 7.0 8.0 22.677682 1.0 34.732588 1.493645 1.436706 29.0 3.0 2.521886 1.13145 66.459969
3 76.000000 211.0 1.000000 84.000000 24.0 0.000000 0.00000 100.000000 7.0 8.0 62.000000 7.0 34.000000 6.000000 6.000000 16.0 5.0 4.000000 1.00000 69.000000
4 57.000000 168.0 73.000000 107.000000 8.0 2.000000 2.00000 89.000000 0.0 6.0 9.000000 0.0 44.000000 0.000000 0.000000 29.0 3.0 2.521886 1.13145 66.459969

Boxplot

Boxplot will provide the general distribution of each features and outliers.

In [0]:
cols = 4
rows = (df.shape[1]-1)//cols + 1

features = df_train.columns[1:]

fig = plt.figure(figsize=(15, 5*rows))
for i, var_name in enumerate(features):
    ax = fig.add_subplot(rows,cols,i+1)
    bp = df_train.boxplot(column=var_name, by='RiskFlag',
                              ax=ax, return_type='dict')
    [value.set_color('r') for value in bp[0]['medians']] # set median line color
    [value.set_linewidth(2) for value in bp[0]['medians']] # set median linewidth

    ax.set_xlabel('')
    ax.set_title(var_name, fontsize=10)
    
plt.suptitle('Box Plot Grouped by RiskFlag', fontsize=15)
plt.tight_layout(rect=[0, 0, 1, 0.97]) # remove extra space
plt.show()

Consolidated version of risk markers obviously is the most important factor deciding the label since the data of two labels show completely different distributions.

Months Since Most Recent Trade Open does not seem to be an important factor deciding good or bad risk since it shows rounghly similar distribution of variables to both labels and has relatively many outliers.

Multicolinearity

To understand the data and select the meaningful and inpendent features, it is important to study the correlation of the data deeper. Below is the correlation heatmap of the data.

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(15,15))
corr = df_train.corr()
sns.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns, annot = True,linewidths=.5, ax=ax)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9061e16940>

We can observe some meaningful findings from the heatmap.

(1) Consolidated version of risk markers is the most highly correlated features among all.

(2) Consolidated version of risk markers is positively correlated with the RiskFlag (response/dependent variable). This implies that the higher the value of this feature is, the less risky the loan is.

(3) Other than Consolidated version of risk markers, other features do not seem to be highly correlated with the response.

(4) Meanwhile, Consolidated version of risk markers is relatively highly correleated with the other features such as "Percent Trades Never Delinquent", "Max Delq/Public Records Last 12 Months", "Max Delinquency Ever", "Net Fraction Revolving Burden", "Number Bank/Natl Trades with high utilization ratio", and "Percent Trades with Balance". It can be inferred that the Consolidated version of risks markers is calculated by the values that are correlated with these features.

(5) "Percent Trades Never Delinquent", "Max Delq/Public Records Last 12 Months", "Max Delinquency Ever" all seem to be relatively highly correlated to each other.

(6) "Number of Inq Last 6 Months" and "Number of Inq Last 6 Months excl 7 days" are highly correlated and this intuitively makes sense.

Consolidated version of risk markers

Based on the observations above, consolidated version of risk markers is obviously the credit score of the data instance (Considering that positively correlated with percentage of the trades never delinquent, negatively correlated with utilization ratio and revolving burden).

It is important to note that this feature is relatively highly related to the riskFlag (response) even though the features by which this feature is calculated are much less correlated with the response. This shows that devising a new features based on the existing features may help boosting up the accuracy of the model by providing more correlated feature.

Considering that Consolidated version of risk markers is the credit score and that the data is provided by the FICO, the feature can be assumed to be the FICO score of the loan calculated by the scorecard below.

source: FICO
https://www.fico.com/en/latest-thinking/fact-sheet/understanding-fico-scores

Feature Engineering

Engineering a new feature

By assuming that the feature is calculated based on the scorecard above and correlations between the data, it can be inferred that Consolidated version of risk markers is calculated based on some values related to Ever Delinquent before (negatively affecting the score: the longer the period of delinquency, the worse the loan is), Utilization ratio, Number of months in file, Number of inquiries before, Number of trades.

Then, it can be said that the features above are already taken into account and therefore, engineering a new feature based on those variables are redundant, increase the multicolinearity between the features, and therefore decrease the interpretability of the model and the data.

Hence, it would be more meaningful to come up with a new feature variable calculated based on the features which are less correlated to the features above. Therefore, it is reasonable to select the features that are intuitively important in explaining the label, but that are not included in calculation of consolidated version of risk markets.

Feature satisfying these criteria is Number Satisfactory Trades. It can be assuemd that higher number of satisfactory trades will positively affect the risk flag (good risk flag). However, consider one data instance (A) has over thousands total trades, but only has 20 satisfactory trades. Assume other data instance has fewer number of satisfactory trades, but with much smaller number of total trades. If only the number of satisfactory trades is considered as the factor affecting the risk, A will be considered to be a data instance who has good risk flag. Here, the result is highly likely to be biased since the percentage of satisfactory trades of A can be significantly lower than the others. Therefore, it seems more reasonable to consider the percentage of satisfactory trades rather than total number of satisfactory trades.

In [0]:
X_train['Satisfactory Trades Percentage'] = X_train['Number Satisfactory Trades'] / X_train['Number of Total Trades']
X_test['Satisfactory Trades Percentage'] = X_test['Number Satisfactory Trades'] / X_test['Number of Total Trades']

#update features
features = X_train.columns
X_train['Satisfactory Trades Percentage']
Out[0]:
0       0.833333
1       0.500000
2       0.269231
3       8.000000
4       0.769231
          ...   
8362         inf
8363    0.883978
8364    0.769231
8365    1.000000
8366    0.933333
Name: Satisfactory Trades Percentage, Length: 8367, dtype: float64

There are some values bigger than 1 as well as infinity, which is misleading. For these values, data will be imputed as mean.

In [0]:
X_train['Satisfactory Trades Percentage'][X_train['Satisfactory Trades Percentage'] > 1.0 ] = np.nan
X_test['Satisfactory Trades Percentage'][X_test['Satisfactory Trades Percentage'] > 1.0 ] = np.nan


X_train
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance Satisfactory Trades Percentage
0 65.000000 101.0 2.000000 34.000000 15.0 0.00000 0.000000 100.00000 7.0 8.0 18.000 5.0 33.000000 9.000000 9.000000 72.0 5.0 3.000000 4.000000 69.000000 0.833333
1 61.000000 149.0 13.000000 75.000000 4.0 4.00000 4.000000 100.00000 7.0 8.0 8.000 0.0 50.000000 0.000000 0.000000 30.0 3.0 2.000000 1.082584 100.000000 0.500000
2 56.000000 169.0 1.000000 56.000000 7.0 2.00000 2.000000 80.00000 6.0 3.0 26.000 4.0 20.000000 3.000000 3.000000 62.0 4.0 2.475789 1.000000 83.000000 0.269231
3 87.000000 99.0 37.000000 99.000000 24.0 0.00000 0.000000 100.00000 7.0 8.0 3.000 0.0 29.000000 0.000000 0.000000 11.0 1.0 2.000000 0.000000 38.000000 NaN
4 88.000000 146.0 16.000000 56.000000 10.0 0.00000 0.000000 100.00000 7.0 8.0 13.000 0.0 10.000000 0.000000 0.000000 16.0 1.0 2.475789 0.000000 17.000000 0.769231
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8362 69.000000 185.0 5.000000 138.000000 30.0 0.00000 0.000000 97.00000 6.0 6.0 0.000 1.0 20.000000 4.000000 3.000000 40.0 12.0 2.000000 3.000000 74.000000 NaN
8363 72.005318 185.0 9.527201 78.554656 20.0 0.58502 0.387905 92.32085 7.0 8.0 22.625 1.0 34.590334 1.446609 1.387272 30.0 3.0 2.475789 1.082584 66.446275 0.883978
8364 73.000000 149.0 2.000000 70.000000 10.0 1.00000 1.000000 85.00000 4.0 4.0 13.000 3.0 31.000000 0.000000 0.000000 47.0 4.0 2.000000 1.000000 86.000000 0.769231
8365 65.000000 241.0 4.000000 80.000000 25.0 0.00000 0.000000 96.00000 4.0 6.0 25.000 1.0 32.000000 2.000000 2.000000 37.0 3.0 3.000000 1.000000 60.000000 1.000000
8366 77.000000 148.0 5.000000 84.000000 14.0 0.00000 0.000000 100.00000 7.0 8.0 15.000 1.0 53.000000 1.000000 1.000000 57.0 1.0 3.000000 1.000000 83.000000 0.933333

8367 rows × 21 columns

In [0]:
new_feature_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
new_feature_imputer2 = SimpleImputer(missing_values=np.nan, strategy="mean")

X_train['Satisfactory Trades Percentage'] = new_feature_imputer.fit_transform(X_train['Satisfactory Trades Percentage'][:, np.newaxis])
X_test['Satisfactory Trades Percentage'] = new_feature_imputer2.fit_transform(X_test['Satisfactory Trades Percentage'][:, np.newaxis])

X_train
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance Satisfactory Trades Percentage
0 65.000000 101.0 2.000000 34.000000 15.0 0.00000 0.000000 100.00000 7.0 8.0 18.000 5.0 33.000000 9.000000 9.000000 72.0 5.0 3.000000 4.000000 69.000000 0.833333
1 61.000000 149.0 13.000000 75.000000 4.0 4.00000 4.000000 100.00000 7.0 8.0 8.000 0.0 50.000000 0.000000 0.000000 30.0 3.0 2.000000 1.082584 100.000000 0.500000
2 56.000000 169.0 1.000000 56.000000 7.0 2.00000 2.000000 80.00000 6.0 3.0 26.000 4.0 20.000000 3.000000 3.000000 62.0 4.0 2.475789 1.000000 83.000000 0.269231
3 87.000000 99.0 37.000000 99.000000 24.0 0.00000 0.000000 100.00000 7.0 8.0 3.000 0.0 29.000000 0.000000 0.000000 11.0 1.0 2.000000 0.000000 38.000000 0.890655
4 88.000000 146.0 16.000000 56.000000 10.0 0.00000 0.000000 100.00000 7.0 8.0 13.000 0.0 10.000000 0.000000 0.000000 16.0 1.0 2.475789 0.000000 17.000000 0.769231
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8362 69.000000 185.0 5.000000 138.000000 30.0 0.00000 0.000000 97.00000 6.0 6.0 0.000 1.0 20.000000 4.000000 3.000000 40.0 12.0 2.000000 3.000000 74.000000 0.890655
8363 72.005318 185.0 9.527201 78.554656 20.0 0.58502 0.387905 92.32085 7.0 8.0 22.625 1.0 34.590334 1.446609 1.387272 30.0 3.0 2.475789 1.082584 66.446275 0.883978
8364 73.000000 149.0 2.000000 70.000000 10.0 1.00000 1.000000 85.00000 4.0 4.0 13.000 3.0 31.000000 0.000000 0.000000 47.0 4.0 2.000000 1.000000 86.000000 0.769231
8365 65.000000 241.0 4.000000 80.000000 25.0 0.00000 0.000000 96.00000 4.0 6.0 25.000 1.0 32.000000 2.000000 2.000000 37.0 3.0 3.000000 1.000000 60.000000 1.000000
8366 77.000000 148.0 5.000000 84.000000 14.0 0.00000 0.000000 100.00000 7.0 8.0 15.000 1.0 53.000000 1.000000 1.000000 57.0 1.0 3.000000 1.000000 83.000000 0.933333

8367 rows × 21 columns

Now, let's see the distribution of the new feature so that any further transformation of the data is necessary.

In [0]:
X_train['Satisfactory Trades Percentage'].hist(bins=15, density=True, color='green', alpha=0.4, rwidth=0.95, figsize=(7,7))
X_train['Satisfactory Trades Percentage'].plot(kind='density', color='green')
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9061c38ac8>

It seems that most of the data instances have high percentage of satisfactory trades. That is, unless the data instance has extremely low value of trade percentage, newly introduced data is assumed to be not effective in predicting the risk flag.

Feature Selection

Roughly 10 features will be selected based on three standards:

(1) Model-free feature selection

  1. Univariate feature selection

(2) Model based feature selection

  1. LassoCV
  2. Permutation Feature Importance

(3) Correlation and outliers

  1. Highly correlated features will not be included. (Only few will be selected)
  2. Feature with too many outliers may lead a misleading result. Therefore these features may be filtered out.

Univariate feature selection

Univariate feature selection algorithm selects the feature based on univariate statistical tests between each of the feature and response. Espeically for univariate feature selection, 'SelectKBest' with chi-squared test will be used.

In [0]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

K_best_classifier = SelectKBest(chi2, k=10).fit(X_train, y_train)
mask = K_best_classifier.get_support(indices = True)
features_by_univariate = X_train.columns[mask]

X_train_univariate = pd.DataFrame(K_best_classifier.transform(X_train), columns = features_by_univariate)
X_train_univariate.head()

# pd.DataFrame(X_train_univariate, columns = X_train.columns[mask])
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Percent Trades Never Delinquent Percent Installment Trades Net Fraction Revolving Burden Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance
0 65.0 101.0 34.0 15.0 0.0 100.0 33.0 72.0 4.000000 69.0
1 61.0 149.0 75.0 4.0 4.0 100.0 50.0 30.0 1.082584 100.0
2 56.0 169.0 56.0 7.0 2.0 80.0 20.0 62.0 1.000000 83.0
3 87.0 99.0 99.0 24.0 0.0 100.0 29.0 11.0 0.000000 38.0
4 88.0 146.0 56.0 10.0 0.0 100.0 10.0 16.0 0.000000 17.0

Logistic Lasso

LassoCV will provide the interpretable feature selection by providing how the coefficients of features can be drawn differently as the constaint parameter changes. It is also important to note to standardize the data before actually applying the Logistic Lasso.

In [0]:
#standardize the features
from sklearn.preprocessing import StandardScaler

#train
scaler = StandardScaler()
X_train_scaled =  pd.DataFrame(scaler.fit_transform(X_train))
X_train_scaled.columns = X_train.columns

#test
scaler2 = StandardScaler()
X_test_scaled = pd.DataFrame(scaler2.fit_transform(X_test))
X_test_scaled.columns = X_test.columns


X_train_scaled.head()
Out[0]:
Consolidated version of risk markers Months Since Oldest Trade Open Months Since Most Recent Trade Open Average Months in File Number Satisfactory Trades Number Trades 60+ Ever Number Trades 90+ Ever Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever Number of Total Trades Number of Trades Open in Last 12 Months Percent Installment Trades Number of Inq Last 6 Months Number of Inq Last 6 Months excl 7days Net Fraction Revolving Burden Number Revolving Trades with Balance Number Installment Trades with Balance Number Bank/Natl Trades w high utilization ratio Percent Trades with Balance Satisfactory Trades Percentage
0 -0.731554 -1.040239 -0.601480 -1.346591 -0.554421 -0.480501 -0.397581 0.667216 0.721058 0.834695 -0.367485 1.780111 -0.091359 3.823627 3.927610 1.345289 0.339408 0.351449 2.053971 0.119847 -4.228582e-01
1 -1.149268 -0.532036 0.277503 -0.107434 -1.557472 2.804865 3.702199 0.667216 0.721058 0.834695 -1.162047 -1.019127 0.885229 -0.732293 -0.715731 -0.163168 -0.354333 -0.318985 0.000000 1.574681 -2.881845e+00
2 -1.671410 -0.320284 -0.681388 -0.681677 -1.283913 1.162182 1.652309 -1.070517 0.108996 -1.880449 0.268165 1.220263 -0.838161 0.786347 0.832049 0.986133 -0.007462 0.000000 -0.058142 0.776868 -4.584221e+00
3 1.565871 -1.061415 2.195284 0.617927 0.266258 -0.480501 -0.397581 0.667216 0.721058 0.834695 -1.559328 -1.019127 -0.321144 -0.732293 -0.715731 -0.845565 -1.048073 -0.318985 -0.762180 -1.334987 -8.190073e-16
4 1.670300 -0.563799 0.517226 -0.681677 -1.010353 -0.480501 -0.397581 0.667216 0.721058 0.834695 -0.764766 -1.019127 -1.412625 -0.732293 -0.715731 -0.665987 -1.048073 0.000000 -0.762180 -2.320520 -8.957403e-01
In [0]:
from sklearn import linear_model
cv = 10
lassocv = linear_model.LassoCV(cv=cv, max_iter=5e4)
lassocv.fit(X_train_scaled, y_train)
print('alpha = %.2e' %lassocv.alpha_)
alpha = 2.26e-04
In [0]:
eps = 5e-5  # the smaller it is the longer is the path
alphas_lasso, coefs_lasso, _ = linear_model.lasso_path(X_train_scaled, y_train, eps, fit_intercept=False)

plt.figure(figsize = (20,15))

NUM_COLORS = 21
LINE_STYLES = ['solid', 'dashed', 'dashdot', 'dotted']
NUM_STYLES = len(LINE_STYLES)
cm = plt.get_cmap('gist_rainbow')

for i in range(X_train_scaled.shape[1]):    
    lines = plt.plot(alphas_lasso, coefs_lasso[i], label = X_train_scaled.columns[i])
    lines[0].set_color(cm(i//NUM_STYLES*float(NUM_STYLES)/NUM_COLORS))
    lines[0].set_linestyle(LINE_STYLES[i%NUM_STYLES])
    
plt.xscale('log')
plt.xlabel('Log($\\lambda$)')
plt.ylabel('coefficients')
plt.title('Lasso paths')
plt.legend()
plt.axis('tight')
Out[0]:
(6.873732545688046e-06,
 0.3701055733710027,
 -0.1743293412314664,
 0.186481846457223)

It can be seen that as the constraint parameter (lambda) decreases, coefficients of different features start to have meaningful values of coefficients.

Notable feature is "Consolidated version of risk markers" which is re-confirmed as the most important feature as in the previous correlation analysis. It can be observed that each time the new feature starts to have its effect as lambda changes, "Consolidated version of risk markers" change trend in its graph. It can be inferred that when the feature that is highly correlated with this feature, the feature may change because newly introduced variable also explains the response that has been previously explained by Consolidated version of risk markers. The graph seems to show the most drastic change at the moment when "Number Bank/Nati Trades with high utilization ratio" and "Percent Trades Never Delinquent" are introduced. This can be explained by the high correlation between these two features and "Consolidated version of risk markers" (See Correlation heatmap above - these two features are most highly correlated).

Furthermore, there is a wide U-shaped curve when the log(lambda) value is 10E-3 generated by two dotted light blue lines. These two features are "Number of Inq Last 6 Months", "Number of Inq Last 6 Months excl 7 days". Intuitively, these two are directly correlated and can be checked in the correlation heatmap as well. Therefore, only one feature, "Number of Inq Last 6 Months" will be selected.

Another interesting observation is that "Percent Trades with Balance" begins to have value and then moves back to value of 0 at the moment "Max Delinquency Ever" begins to have effect, and after a while it begins to have effect again. (ADD MORE DETAIL)

According to the analysis of the graph, features are selected in descending order of importance as follows:

In [0]:
features_by_lasso = ["Consolidated version of risk markers", "Net Fraction Revolving Burden", "Number Satisfactory Trades", 
                     "Average Months in File", "Number of Inq Last 6 Months", "Number Bank/Natl Trades w high utilization ratio",
                     "Months Since Oldest Trade Open", "Percent Trades Never Delinquent", "Max Delq/Public Records Last 12 Months",
                     "Max Delinquency Ever"]

X_train_lasso = X_train[features_by_lasso]
X_test_lasso = X_test[features_by_lasso]
X_train_lasso.head()
Out[0]:
Consolidated version of risk markers Net Fraction Revolving Burden Number Satisfactory Trades Average Months in File Number of Inq Last 6 Months Number Bank/Natl Trades w high utilization ratio Months Since Oldest Trade Open Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever
0 65.0 72.0 15.0 34.0 9.0 4.000000 101.0 100.0 7.0 8.0
1 61.0 30.0 4.0 75.0 0.0 1.082584 149.0 100.0 7.0 8.0
2 56.0 62.0 7.0 56.0 3.0 1.000000 169.0 80.0 6.0 3.0
3 87.0 11.0 24.0 99.0 0.0 0.000000 99.0 100.0 7.0 8.0
4 88.0 16.0 10.0 56.0 0.0 0.000000 146.0 100.0 7.0 8.0

Permutation Feature Importance

Permutation feature importance can be evaluated by the metric score variation caused by the feature value permutation. XGBoost model will be used to further leverage on permutation object.

In [0]:
# fit a XGBoost
import xgboost as xgb

xgb_clf = xgb.XGBClassifier(n_estimators=300, learning_rate=0.05)
xgb_clf.fit(X_train, y_train)
Out[0]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.05, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
In [0]:
from sklearn.metrics import accuracy_score

y_pred = xgb_clf.predict(X_train)
print("training accuracy by xgboost", accuracy_score(y_pred, y_train))
training accuracy by xgboost 0.7454284689852994
In [0]:
features
Out[0]:
Index(['Consolidated version of risk markers',
       'Months Since Oldest Trade Open', 'Months Since Most Recent Trade Open',
       'Average Months in File', 'Number Satisfactory Trades',
       'Number Trades 60+ Ever', 'Number Trades 90+ Ever',
       'Percent Trades Never Delinquent',
       'Max Delq/Public Records Last 12 Months', 'Max Delinquency Ever',
       'Number of Total Trades', 'Number of Trades Open in Last 12 Months',
       'Percent Installment Trades', 'Number of Inq Last 6 Months',
       'Number of Inq Last 6 Months excl 7days',
       'Net Fraction Revolving Burden', 'Number Revolving Trades with Balance',
       'Number Installment Trades with Balance',
       'Number Bank/Natl Trades w high utilization ratio',
       'Percent Trades with Balance', 'Satisfactory Trades Percentage'],
      dtype='object')
In [0]:
 !pip install eli5

import eli5
from eli5.sklearn import PermutationImportance

# define a permutation importance object
perm = PermutationImportance(xgb_clf).fit(X_train, y_train)
# show the importance
eli5.show_weights(perm, feature_names= np.array(features) )
Requirement already satisfied: eli5 in /usr/local/lib/python3.6/dist-packages (0.10.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from eli5) (1.12.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from eli5) (1.2.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: attrs>16.0.0 in /usr/local/lib/python3.6/dist-packages (from eli5) (19.3.0)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.6/dist-packages (from eli5) (0.21.3)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.6/dist-packages (from eli5) (0.8.6)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from eli5) (2.10.3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from eli5) (1.17.4)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18->eli5) (0.14.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->eli5) (1.1.1)
Using TensorFlow backend.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x magic: more info.

Out[0]:
Weight Feature
0.0719 ± 0.0093 Consolidated version of risk markers
0.0142 ± 0.0072 Number Satisfactory Trades
0.0123 ± 0.0032 Average Months in File
0.0051 ± 0.0044 Percent Trades Never Delinquent
0.0039 ± 0.0030 Net Fraction Revolving Burden
0.0038 ± 0.0017 Months Since Oldest Trade Open
0.0037 ± 0.0021 Percent Installment Trades
0.0026 ± 0.0012 Percent Trades with Balance
0.0025 ± 0.0033 Number Revolving Trades with Balance
0.0019 ± 0.0026 Max Delq/Public Records Last 12 Months
0.0018 ± 0.0016 Number of Total Trades
0.0015 ± 0.0009 Months Since Most Recent Trade Open
0.0012 ± 0.0010 Number of Trades Open in Last 12 Months
0.0011 ± 0.0019 Satisfactory Trades Percentage
0.0007 ± 0.0018 Number Bank/Natl Trades w high utilization ratio
0.0006 ± 0.0014 Number Installment Trades with Balance
0.0006 ± 0.0008 Number Trades 90+ Ever
0.0005 ± 0.0007 Number of Inq Last 6 Months excl 7days
0.0005 ± 0.0002 Number Trades 60+ Ever
-0.0001 ± 0.0005 Max Delinquency Ever
… 1 more …
In [0]:
# importance in decreasing order
imp_ord = np.argsort(perm.feature_importances_)

plt.figure(figsize=(9,8))
yaxis = np.arange(len(perm.feature_importances_))*1.2
plt.barh(y = yaxis,width = perm.feature_importances_[imp_ord])
plt.yticks(yaxis,np.array(features)[imp_ord])
plt.ylabel('Feature')
plt.xlabel('Importance')
plt.show()

It can be observed that the feature rankings are different from other two methods. Important features selected by tree-based selection include features as follows:

In [0]:
features_by_tree = ["Consolidated version of risk markers", "Number Satisfactory Trades", "Average Months in File",
                    "Percent Trades Never Delinquent", "Net Fraction Revolving Burden", "Months Since Oldest Trade Open",
                    "Percent Installment Trades", "Percent Trades with Balance", "Number Revolving Trades with Balance",
                    "Max Delq/Public Records Last 12 Months"]

X_train_tree = X_train[features_by_tree]
X_test_tree = X_test[features_by_tree]
X_train_tree.head()
Out[0]:
Consolidated version of risk markers Number Satisfactory Trades Average Months in File Percent Trades Never Delinquent Net Fraction Revolving Burden Months Since Oldest Trade Open Percent Installment Trades Percent Trades with Balance Number Revolving Trades with Balance Max Delq/Public Records Last 12 Months
0 65.0 15.0 34.0 100.0 72.0 101.0 33.0 69.0 5.0 7.0
1 61.0 4.0 75.0 100.0 30.0 149.0 50.0 100.0 3.0 7.0
2 56.0 7.0 56.0 80.0 62.0 169.0 20.0 83.0 4.0 6.0
3 87.0 24.0 99.0 100.0 11.0 99.0 29.0 38.0 1.0 7.0
4 88.0 10.0 56.0 100.0 16.0 146.0 10.0 17.0 1.0 7.0

Model

Goal of this project is to achieve a high prediction accuracy and interpretable model at the same time. Therefore, both white-box models and black-box models will be used and provide various interpretations according to the results of fitted models. For white-box models, visualization and model-diagnostic methods wil be discussed. For black-box models, model agnostic post hoc methods such as VI, PDP, ICE, and SHAP will be used.

White-box model includes: GAM

Black-box models include: Gradient Boosting, MLP

Post hoc methods:

Variable Importance: for the forest based fitted models, variable importance will be discussed for the global interpretation.

Partial Dependency Plot (PDP) / Individual Conditional Expectation (ICE): Considering that the features are not highly correlated, dependency plots will provide meaningful insights of features globally.

SHAP: For both global and local interpretation, SHAP will be leveraged to explain how the changes in feature affect the prediction. SHAP will be fully leveraged especially for Deep Neural Network model.

In [0]:
!pip install shap
Requirement already satisfied: shap in /usr/local/lib/python3.6/dist-packages (0.33.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from shap) (1.2.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from shap) (0.25.3)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.6/dist-packages (from shap) (4.28.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from shap) (1.17.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from shap) (0.21.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2.6.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->shap) (0.14.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->shap) (1.12.0)
In [0]:
import shap
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

White-box Models

Logistic GAM

LogisticGAM will be used as main model along with some feature engineering techniques such as linear splines and B-splines with degree 3.

In [0]:
!pip install pygam
Requirement already satisfied: pygam in /usr/local/lib/python3.6/dist-packages (0.8.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pygam) (0.16.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from pygam) (1.2.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pygam) (1.17.4)
Requirement already satisfied: progressbar2 in /usr/local/lib/python3.6/dist-packages (from pygam) (3.38.0)
Requirement already satisfied: python-utils>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from progressbar2->pygam) (2.3.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from progressbar2->pygam) (1.12.0)
In [0]:
from pygam import LogisticGAM, f, s, l

1. Linear Splines

In [0]:
logit_GAM_linear_spline = LogisticGAM(spline_order = 1)
logit_GAM_linear_spline.gridsearch(X_train_lasso.values, y_train)
100% (11 of 11) |########################| Elapsed Time: 0:00:16 Time:  0:00:16
Out[0]:
LogisticGAM(callbacks=[Deviance(), Diffs(), Accuracy()], 
   fit_intercept=True, max_iter=100, 
   terms=s(0) + s(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9) + intercept,
   tol=0.0001, verbose=False)

2. B-splines

In [0]:
logit_GAM_b_spline = LogisticGAM() 

# gridsearch to tune the 5 penalty strengths in the current GAM model
logit_GAM_b_spline.gridsearch(X_train_lasso.values, y_train)
100% (11 of 11) |########################| Elapsed Time: 0:00:18 Time:  0:00:18
Out[0]:
LogisticGAM(callbacks=[Deviance(), Diffs(), Accuracy()], 
   fit_intercept=True, max_iter=100, 
   terms=s(0) + s(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9) + intercept,
   tol=0.0001, verbose=False)
In [0]:
print('GAM_linear training accuracy: ', accuracy_score(y_train, logit_GAM_linear_spline.predict(X_train_lasso)))
print('GAM_linear test accuracy: ', accuracy_score(y_test, logit_GAM_linear_spline.predict(X_test_lasso)))

print('GAM_B_spline training accuracy: ', accuracy_score(y_train, logit_GAM_b_spline.predict(X_train_lasso)))
print('GAM_B_spline test accuracy: ', accuracy_score(y_test, logit_GAM_b_spline.predict(X_test_lasso)))
GAM_linear training accuracy:  0.7182980757738735
GAM_linear test accuracy:  0.7136711281070746
GAM_B_spline training accuracy:  0.7200908330345405
GAM_B_spline test accuracy:  0.7189292543021033

Higher accuracy of B-spline may be due to non-linearity of the features on the label. Therefore, it will be worth discussing more on B-Spline model than on linear spline.

Model Summary

In [0]:
logit_GAM_b_spline.summary()
LogisticGAM                                                                                               
=============================================== ==========================================================
Distribution:                      BinomialDist Effective DoF:                                     27.9927
Link Function:                        LogitLink Log Likelihood:                                 -4615.2164
Number of Samples:                         8367 AIC:                                             9286.4181
                                                AICc:                                            9286.6267
                                                UBRE:                                               3.1126
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                   0.2032
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [251.1886]           20           4.4          0.00e+00     ***         
s(1)                              [251.1886]           20           2.5          1.15e-05     ***         
s(2)                              [251.1886]           20           3.5          8.51e-13     ***         
s(3)                              [251.1886]           20           2.7          1.29e-08     ***         
s(4)                              [251.1886]           20           1.5          5.03e-08     ***         
s(5)                              [251.1886]           20           2.3          2.44e-09     ***         
s(6)                              [251.1886]           20           2.6          1.32e-01                 
s(7)                              [251.1886]           20           3.1          1.32e-06     ***         
s(8)                              [251.1886]           20           2.8          5.73e-05     ***         
s(9)                              [251.1886]           20           2.6          7.72e-01                 
intercept                                              1            0.0          1.36e-07     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.
KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

It can be easily found that first feature (Consolidated version of risk markers) has the lowest p-value. This implies that Consolidated version of risk markers is the most significant among all as expected. Importance of the features can be implied based on these p-values.

Furthermore, pseudo R-squared seems to be in good range that the model is well fitted (see https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation).

Partial Dependency Plot (PDP)

In [0]:
# partial dependence plot
fig, axs = plt.subplots(2,5,figsize=(20,7))
for i, ax in enumerate(axs.flatten()):
    XX = logit_GAM_b_spline.generate_X_grid(term=i)
    plt.subplot(ax)
    plt.plot(XX[:,i], logit_GAM_b_spline.partial_dependence(term=i, X=XX))
    plt.plot(XX[:,i], logit_GAM_b_spline.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
    plt.title(X_train_lasso.columns[i])
plt.tight_layout()

The larger the red line (confidence interval of feature) is, the wider the range that we are confident that the sample mean lies in that range. In other words, for wider the confidence interval is, the more ranges that we have to check whether the sample mean actually lies in that range. For example, Consolidated version of risk markers have much narrower confidence interval than Max Delinquency Ever. That is, the sample mean lies in the narrower range of values of Consolidate version of risk markers than Max Delinquency Ever in 95% confidence level. This may be due to the fact that Max Delinquency Ever contains much more outliers than Consolidated version of risk markers (see boxplot above), thereby making it hard to confidently narrow down the possible range which the actual sample mean lies in.

Black-box Models

Gradient Boosting

Gradient Boosting will be trained with the features selected by permutation importance with xgboost. For global interpretation methods, Variable Importance (VI) and Partial Dependency Plots (PDP) will be used to explain how overall feature variations affect prediction. For local interpretation, SHAP will be used to explain how local feature change attributes to prediction.

Attempt on fitting gradient boosting with default setting

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
from matplotlib.legend_handler import HandlerLine2D
In [0]:
gradient_boosting_classifier = GradientBoostingClassifier()

gradient_boosting_classifier.fit(X_train_tree, y_train)

print('gradient boosting training accuracy: ', accuracy_score(y_train, gradient_boosting_classifier.predict(X_train_tree)))

y_pred = gradient_boosting_classifier.predict(X_test_tree)
print('gradient boosting testing accuracy: ', accuracy_score(y_test, y_pred))
gradient boosting training accuracy:  0.7406477829568543
gradient boosting testing accuracy:  0.7088910133843213

Model Evaluation

Since the task is binary classification, AUC is a good option to evaluate the model accuracy. By fitting the model with different parameter values, change in auc curve will be used to decide the optimal values for each parameter. After selecting rough range of parameters, the model will be further tuned with seqMML to find the best combinations of these parameters.

In [0]:
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print(roc_auc)

plot_roc_curve(false_positive_rate, true_positive_rate)
0.7058727016077289

Hyperparameter tuning
Plotting learning rate and min samples leaf graph to find overfitting limit

In [0]:
hyperparamsdict = {
    'learning_rate': [1, 0.5, 0.3, 0.25, 0.1, 0.05, 0.01],
    'min_samples_leaf': range(20,71,10),
}

for key in hyperparamsdict:
  train_results = []
  test_results = []
  for value in hyperparamsdict.get(key):
    temp_model = GradientBoostingClassifier(**{key:value})
    temp_model.fit(X_train_tree, y_train)
    
    y_train_pred = temp_model.predict(X_train_tree)   
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, y_train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)

    y_pred = temp_model.predict(X_test_tree)   
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

  line1, = plt.plot(hyperparamsdict.get(key), train_results, 'b', label='Train AUC')
  line2, = plt.plot(hyperparamsdict.get(key), test_results, 'r', label='Test AUC')
  plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
  plt.ylabel('AUC score')
  plt.xlabel(key)
  plt.show()

Set initial learning rate and find optimal number of tree
learning rate was initially set to be 0.25 according to the learning rate graph. However, number of trees found to be 20, lower limit of search range. Therefore, learning rate is lowered.

In [0]:
gb = GradientBoostingClassifier(learning_rate=0.05, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=3612)

parameter_space = {
    'n_estimators': range(20,81,10)
}

gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)

# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
0.716 (+/-0.009) for {'n_estimators': 20}
0.718 (+/-0.010) for {'n_estimators': 30}
0.719 (+/-0.009) for {'n_estimators': 40}
0.718 (+/-0.010) for {'n_estimators': 50}
0.718 (+/-0.010) for {'n_estimators': 60}
0.717 (+/-0.013) for {'n_estimators': 70}
0.719 (+/-0.009) for {'n_estimators': 80}
Best parameters found: {'n_estimators': 40}
Best score found: 0.7191346958288515
Accuracy: 0.705545
AUC score: 0.7024400080661424

Optimal n_estimators found to be 40, an acceptable number.

Testing max depth and min samples split next.

In [0]:
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, min_samples_leaf=50,max_features='sqrt',subsample=0.8,random_state=3612)

parameter_space = {
    'max_depth':range(3,20,2), 
    'min_samples_split':range(100,551,50)
}

gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)

# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
print('Accuracy:', accuracy_score(y_test, gb_clf.predict(X_test_tree)).round(6))
0.715 (+/-0.004) for {'max_depth': 3, 'min_samples_split': 100}
0.716 (+/-0.004) for {'max_depth': 3, 'min_samples_split': 150}
0.715 (+/-0.007) for {'max_depth': 3, 'min_samples_split': 200}
0.715 (+/-0.004) for {'max_depth': 3, 'min_samples_split': 250}
0.716 (+/-0.004) for {'max_depth': 3, 'min_samples_split': 300}
0.715 (+/-0.004) for {'max_depth': 3, 'min_samples_split': 350}
0.715 (+/-0.003) for {'max_depth': 3, 'min_samples_split': 400}
0.715 (+/-0.004) for {'max_depth': 3, 'min_samples_split': 450}
0.713 (+/-0.002) for {'max_depth': 3, 'min_samples_split': 500}
0.714 (+/-0.005) for {'max_depth': 3, 'min_samples_split': 550}
0.718 (+/-0.006) for {'max_depth': 5, 'min_samples_split': 100}
0.719 (+/-0.009) for {'max_depth': 5, 'min_samples_split': 150}
0.718 (+/-0.012) for {'max_depth': 5, 'min_samples_split': 200}
0.717 (+/-0.006) for {'max_depth': 5, 'min_samples_split': 250}
0.719 (+/-0.006) for {'max_depth': 5, 'min_samples_split': 300}
0.718 (+/-0.013) for {'max_depth': 5, 'min_samples_split': 350}
0.718 (+/-0.009) for {'max_depth': 5, 'min_samples_split': 400}
0.718 (+/-0.006) for {'max_depth': 5, 'min_samples_split': 450}
0.717 (+/-0.008) for {'max_depth': 5, 'min_samples_split': 500}
0.717 (+/-0.007) for {'max_depth': 5, 'min_samples_split': 550}
0.718 (+/-0.009) for {'max_depth': 7, 'min_samples_split': 100}
0.719 (+/-0.009) for {'max_depth': 7, 'min_samples_split': 150}
0.717 (+/-0.012) for {'max_depth': 7, 'min_samples_split': 200}
0.719 (+/-0.009) for {'max_depth': 7, 'min_samples_split': 250}
0.718 (+/-0.009) for {'max_depth': 7, 'min_samples_split': 300}
0.719 (+/-0.009) for {'max_depth': 7, 'min_samples_split': 350}
0.718 (+/-0.009) for {'max_depth': 7, 'min_samples_split': 400}
0.718 (+/-0.008) for {'max_depth': 7, 'min_samples_split': 450}
0.718 (+/-0.006) for {'max_depth': 7, 'min_samples_split': 500}
0.719 (+/-0.005) for {'max_depth': 7, 'min_samples_split': 550}
0.717 (+/-0.011) for {'max_depth': 9, 'min_samples_split': 100}
0.719 (+/-0.012) for {'max_depth': 9, 'min_samples_split': 150}
0.718 (+/-0.010) for {'max_depth': 9, 'min_samples_split': 200}
0.718 (+/-0.011) for {'max_depth': 9, 'min_samples_split': 250}
0.717 (+/-0.010) for {'max_depth': 9, 'min_samples_split': 300}
0.717 (+/-0.012) for {'max_depth': 9, 'min_samples_split': 350}
0.716 (+/-0.010) for {'max_depth': 9, 'min_samples_split': 400}
0.717 (+/-0.008) for {'max_depth': 9, 'min_samples_split': 450}
0.719 (+/-0.011) for {'max_depth': 9, 'min_samples_split': 500}
0.717 (+/-0.006) for {'max_depth': 9, 'min_samples_split': 550}
0.718 (+/-0.007) for {'max_depth': 11, 'min_samples_split': 100}
0.718 (+/-0.016) for {'max_depth': 11, 'min_samples_split': 150}
0.718 (+/-0.011) for {'max_depth': 11, 'min_samples_split': 200}
0.717 (+/-0.010) for {'max_depth': 11, 'min_samples_split': 250}
0.717 (+/-0.011) for {'max_depth': 11, 'min_samples_split': 300}
0.716 (+/-0.010) for {'max_depth': 11, 'min_samples_split': 350}
0.717 (+/-0.010) for {'max_depth': 11, 'min_samples_split': 400}
0.718 (+/-0.007) for {'max_depth': 11, 'min_samples_split': 450}
0.719 (+/-0.010) for {'max_depth': 11, 'min_samples_split': 500}
0.720 (+/-0.009) for {'max_depth': 11, 'min_samples_split': 550}
0.716 (+/-0.008) for {'max_depth': 13, 'min_samples_split': 100}
0.719 (+/-0.012) for {'max_depth': 13, 'min_samples_split': 150}
0.719 (+/-0.015) for {'max_depth': 13, 'min_samples_split': 200}
0.718 (+/-0.002) for {'max_depth': 13, 'min_samples_split': 250}
0.717 (+/-0.011) for {'max_depth': 13, 'min_samples_split': 300}
0.720 (+/-0.006) for {'max_depth': 13, 'min_samples_split': 350}
0.718 (+/-0.011) for {'max_depth': 13, 'min_samples_split': 400}
0.720 (+/-0.008) for {'max_depth': 13, 'min_samples_split': 450}
0.719 (+/-0.010) for {'max_depth': 13, 'min_samples_split': 500}
0.719 (+/-0.008) for {'max_depth': 13, 'min_samples_split': 550}
0.715 (+/-0.007) for {'max_depth': 15, 'min_samples_split': 100}
0.718 (+/-0.010) for {'max_depth': 15, 'min_samples_split': 150}
0.720 (+/-0.016) for {'max_depth': 15, 'min_samples_split': 200}
0.717 (+/-0.007) for {'max_depth': 15, 'min_samples_split': 250}
0.717 (+/-0.011) for {'max_depth': 15, 'min_samples_split': 300}
0.720 (+/-0.007) for {'max_depth': 15, 'min_samples_split': 350}
0.718 (+/-0.010) for {'max_depth': 15, 'min_samples_split': 400}
0.719 (+/-0.010) for {'max_depth': 15, 'min_samples_split': 450}
0.719 (+/-0.010) for {'max_depth': 15, 'min_samples_split': 500}
0.719 (+/-0.007) for {'max_depth': 15, 'min_samples_split': 550}
0.717 (+/-0.008) for {'max_depth': 17, 'min_samples_split': 100}
0.719 (+/-0.010) for {'max_depth': 17, 'min_samples_split': 150}
0.720 (+/-0.016) for {'max_depth': 17, 'min_samples_split': 200}
0.717 (+/-0.007) for {'max_depth': 17, 'min_samples_split': 250}
0.718 (+/-0.011) for {'max_depth': 17, 'min_samples_split': 300}
0.720 (+/-0.007) for {'max_depth': 17, 'min_samples_split': 350}
0.718 (+/-0.010) for {'max_depth': 17, 'min_samples_split': 400}
0.719 (+/-0.010) for {'max_depth': 17, 'min_samples_split': 450}
0.719 (+/-0.010) for {'max_depth': 17, 'min_samples_split': 500}
0.719 (+/-0.007) for {'max_depth': 17, 'min_samples_split': 550}
0.716 (+/-0.007) for {'max_depth': 19, 'min_samples_split': 100}
0.719 (+/-0.010) for {'max_depth': 19, 'min_samples_split': 150}
0.720 (+/-0.016) for {'max_depth': 19, 'min_samples_split': 200}
0.717 (+/-0.007) for {'max_depth': 19, 'min_samples_split': 250}
0.717 (+/-0.013) for {'max_depth': 19, 'min_samples_split': 300}
0.720 (+/-0.007) for {'max_depth': 19, 'min_samples_split': 350}
0.718 (+/-0.010) for {'max_depth': 19, 'min_samples_split': 400}
0.719 (+/-0.010) for {'max_depth': 19, 'min_samples_split': 450}
0.719 (+/-0.010) for {'max_depth': 19, 'min_samples_split': 500}
0.719 (+/-0.007) for {'max_depth': 19, 'min_samples_split': 550}
Best parameters found: {'max_depth': 13, 'min_samples_split': 350}
Best score found: 0.7204493844866738
Accuracy: 0.706979

Max depth and min samples split found to be 13 and 350.
Min samples split is logical given the data size, whereas max depth may be a bit large given there are only 10 features.
High max depth is found to lead to overfitting, but it is not shown at this stage.

Testing min samples leaf mainly while allowing some variation to min samples split with above result as reference.

In [0]:
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, max_depth=13,max_features='sqrt',subsample=0.8,random_state=3612)

parameter_space = {
    'min_samples_split': range(250,451,50),
    'min_samples_leaf': range(30,71,10),
}

gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)

# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
print('Accuracy:', accuracy_score(y_test, gb_clf.predict(X_test_tree)).round(6))
0.719 (+/-0.015) for {'min_samples_leaf': 30, 'min_samples_split': 250}
0.718 (+/-0.009) for {'min_samples_leaf': 30, 'min_samples_split': 300}
0.717 (+/-0.007) for {'min_samples_leaf': 30, 'min_samples_split': 350}
0.719 (+/-0.007) for {'min_samples_leaf': 30, 'min_samples_split': 400}
0.718 (+/-0.009) for {'min_samples_leaf': 30, 'min_samples_split': 450}
0.717 (+/-0.008) for {'min_samples_leaf': 40, 'min_samples_split': 250}
0.718 (+/-0.010) for {'min_samples_leaf': 40, 'min_samples_split': 300}
0.718 (+/-0.011) for {'min_samples_leaf': 40, 'min_samples_split': 350}
0.718 (+/-0.007) for {'min_samples_leaf': 40, 'min_samples_split': 400}
0.719 (+/-0.010) for {'min_samples_leaf': 40, 'min_samples_split': 450}
0.718 (+/-0.002) for {'min_samples_leaf': 50, 'min_samples_split': 250}
0.717 (+/-0.011) for {'min_samples_leaf': 50, 'min_samples_split': 300}
0.720 (+/-0.006) for {'min_samples_leaf': 50, 'min_samples_split': 350}
0.718 (+/-0.011) for {'min_samples_leaf': 50, 'min_samples_split': 400}
0.720 (+/-0.008) for {'min_samples_leaf': 50, 'min_samples_split': 450}
0.718 (+/-0.011) for {'min_samples_leaf': 60, 'min_samples_split': 250}
0.717 (+/-0.011) for {'min_samples_leaf': 60, 'min_samples_split': 300}
0.721 (+/-0.010) for {'min_samples_leaf': 60, 'min_samples_split': 350}
0.718 (+/-0.011) for {'min_samples_leaf': 60, 'min_samples_split': 400}
0.717 (+/-0.008) for {'min_samples_leaf': 60, 'min_samples_split': 450}
0.716 (+/-0.007) for {'min_samples_leaf': 70, 'min_samples_split': 250}
0.717 (+/-0.006) for {'min_samples_leaf': 70, 'min_samples_split': 300}
0.719 (+/-0.009) for {'min_samples_leaf': 70, 'min_samples_split': 350}
0.717 (+/-0.011) for {'min_samples_leaf': 70, 'min_samples_split': 400}
0.717 (+/-0.009) for {'min_samples_leaf': 70, 'min_samples_split': 450}
Best parameters found: {'min_samples_leaf': 60, 'min_samples_split': 350}
Best score found: 0.7211664873909406
Accuracy: 0.707935

Optimal min samples split stays the same at 350 while min samples leaf found to be 60.

Testing max feature next.

In [0]:
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, max_depth=13, min_samples_split=350, min_samples_leaf=60,subsample=0.8,random_state=3612)

parameter_space = {
    'max_features':range(1,10,1)
}

gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)

# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
print('Accuracy:', accuracy_score(y_test, gb_clf.predict(X_test_tree)).round(6))
0.715 (+/-0.013) for {'max_features': 1}
0.717 (+/-0.007) for {'max_features': 2}
0.721 (+/-0.010) for {'max_features': 3}
0.719 (+/-0.008) for {'max_features': 4}
0.717 (+/-0.007) for {'max_features': 5}
0.718 (+/-0.009) for {'max_features': 6}
0.716 (+/-0.010) for {'max_features': 7}
0.717 (+/-0.008) for {'max_features': 8}
0.717 (+/-0.013) for {'max_features': 9}
Best parameters found: {'max_features': 3}
Best score found: 0.7211664873909406
Accuracy: 0.707935

Max feature found to be 3, which agrees with common default values of square root of total number of features.

Finally, testing subsample size.

In [0]:
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=40, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, random_state=3612)

parameter_space = {
    'subsample': [0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]
}

gb_clf = GridSearchCV(gb, parameter_space, n_jobs=-1, cv=5)
gb_clf.fit(X_train_tree,y_train)

# All results
means = gb_clf.cv_results_['mean_test_score']
stds = gb_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gb_clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print('Best parameters found:', gb_clf.best_params_)
print('Best score found:', gb_clf.best_score_)
y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
0.718 (+/-0.012) for {'subsample': 0.6}
0.718 (+/-0.009) for {'subsample': 0.65}
0.719 (+/-0.007) for {'subsample': 0.7}
0.719 (+/-0.012) for {'subsample': 0.75}
0.721 (+/-0.010) for {'subsample': 0.8}
0.720 (+/-0.011) for {'subsample': 0.85}
0.716 (+/-0.010) for {'subsample': 0.9}
0.719 (+/-0.010) for {'subsample': 0.95}
Best parameters found: {'subsample': 0.8}
Best score found: 0.7211664873909406
Accuracy: 0.707935
AUC score: 0.7048626005976277

Subsample size found to be 0.8, which again is a common default value for tuning gradient boosting.

Determining final learning rate and number of trees, this is determined by halving learning_rate while increasing n_estimators proportionally.

In [0]:
gb_clf = GradientBoostingClassifier(learning_rate=0.01, n_estimators=200, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)

gb_clf.fit(X_train_tree,y_train)

y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
Accuracy: 0.707935
AUC score: 0.704965260591395
In [0]:
gb_clf = GradientBoostingClassifier(learning_rate=0.005, n_estimators=400, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)

gb_clf.fit(X_train_tree,y_train)

y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
Accuracy: 0.708891
AUC score: 0.7057187116170783
In [0]:
gb_clf = GradientBoostingClassifier(learning_rate=0.0025, n_estimators=800, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)

gb_clf.fit(X_train_tree,y_train)

y_pred = gb_clf.predict(X_test_tree)
print('Accuracy:', accuracy_score(y_test, y_pred).round(6))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc_score = auc(false_positive_rate, true_positive_rate)
print('AUC score:', auc_score)
Accuracy: 0.706501
AUC score: 0.7034501090762435

Accuracy dropped at this point, choosing model using learning rate=0.005 with n_estimators=400 as final model.

Final gradient boosting model:

In [0]:
gradient_boosting_classifier = GradientBoostingClassifier(learning_rate=0.005, n_estimators=400, max_depth=13, min_samples_split=350, min_samples_leaf=60, max_features=3, subsample=0.8, random_state=3612)

gradient_boosting_classifier.fit(X_train_tree,y_train)

print('Training Accuracy:', accuracy_score(y_train, gradient_boosting_classifier.predict(X_train_tree)).round(6))
y_pred = gradient_boosting_classifier.predict(X_test_tree)
print('Test Accuracy:', accuracy_score(y_test, y_pred).round(6))

Both AUC score and test accuracy showed ~0.003 improvement. But they are still relatively low.
Test set accuracy is 0.7089

Model Interpretation

Variable Importance (VI)

In [0]:
importances = np.array(gradient_boosting_classifier.feature_importances_)
arg_sorted = np.argsort(importances)

yaxis = np.arange(len(importances)) 
plt.barh(y = yaxis, width =importances[arg_sorted])
plt.yticks(yaxis, X_train_tree.columns[arg_sorted])
plt.ylabel('Feature')
plt.xlabel('Importance')
plt.show()

Output does not seem to deviate too much from permutation importances of single tree (see above). Only the orders of Average Months in File and Number Satisfactory Trades are switched. This may be due to the correlation between these two features with the other features deleted after only 10 features are selected.

Partial Dependency Plot (PDP)

In [0]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.inspection import partial_dependence
from mpl_toolkits.mplot3d import Axes3D

# ignore warnings for using deprecated partial dependency plot
import warnings
In [0]:
warnings.filterwarnings('ignore')
fig, ax = plot_partial_dependence(gradient_boosting_classifier, 
                                  X_train_tree,
                                  features=X_train_tree.columns,
                                  feature_names = features_by_tree
                                  )

fig.set_figwidth(9.5)
fig.set_figheight(9.5)
fig.tight_layout()

Among all, variations in Consolidated version of risk markers and Number satisfactory Trades seem to affect the odds of risk flag the most. It is reconfirmed that Consolidated version of risk markers is one of the most (if not the most) influenctial factor for predition. Furthermore, It is worth noting that extremely high number of satisfactory trades highly affect the odds of risk flag while extremely low value does not tell much about the flag.

Percent Trades Never Delinquent seems to have an interesting plot. Trades never delinquent percentage lower than 55 does not affect the prediction, while that after 60 seems to positively correlated to the prediction. This may imply that if the percentage of never-delinquent trades is lower than 55%, the data instance are likely to have the bad risk flag regardless of how low it is.

Interaction Plots

After undergoing some interaction plots, Net Fraction Revolving Burden and Number of Inq Last 6 Months seem to have meaningful interactions.

In [0]:
fig = plt.figure(figsize=(10,7))

features_plot = ("Average Months in File", "Number Revolving Trades with Balance")
pdp, axes = partial_dependence(gradient_boosting_classifier, X_train_tree, [2,8])
XX, YY = np.meshgrid(axes[0], axes[1])
Z = pdp[0].T
ax = Axes3D(fig)
surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1,
                       cmap=plt.cm.BuPu, edgecolor='k')
ax.set_xlabel(features_plot[0])
ax.set_ylabel(features_plot[1])
ax.set_zlabel('Partial dependence')

ax.view_init(elev=22, azim=122)
plt.colorbar(surf)
plt.suptitle('Partial dependence of Risk Flag \n'
             'with Gradient Boosting')
plt.subplots_adjust(top=0.9)

plt.show()

This graph cleary shows the interaction between seemingly uncorrelated features. When Number Revolving Trades with Balance is smaller then 20, higher values of Average Months in File seem to explain the odds of the dependence more than the lower values. However, It can be observed that when the Number Revolving Trades with Balance greater than 20, Average Months in File seem to drastically change its trend. Values smaller than 50 seem to attribute to the odds of the depence much more than higher values.

SHAP

In [0]:
shap.initjs()
# define the explainer
tree_explainer = shap.TreeExplainer(gradient_boosting_classifier)
# calculate the shape value
shap_values = tree_explainer.shap_values(X_train_tree)
In [0]:
shap.initjs()
shap.force_plot(tree_explainer.expected_value, shap_values, X_train_tree, feature_names=X_train_tree.columns)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Deep Neural Network

In [0]:
from tensorflow import keras
from tensorflow.keras import layers

# from keras.optimizers import Adam
from keras.losses import binary_crossentropy
from keras.activations import relu, elu, sigmoid

For better optimization, standardizing the training data will be helpful.

In [0]:
#scale the features selected by lasso
scaler2 = StandardScaler()
X_train_lasso_scaled = pd.DataFrame(scaler2.fit_transform(X_train_lasso))
X_train_lasso_scaled.columns = features_by_lasso

X_train_lasso_scaled.head()
Out[0]:
Consolidated version of risk markers Net Fraction Revolving Burden Number Satisfactory Trades Average Months in File Number of Inq Last 6 Months Number Bank/Natl Trades w high utilization ratio Months Since Oldest Trade Open Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever
0 -0.731554 1.345289 -0.554421 -1.346591 3.823627 2.053971 -1.040239 0.667216 0.721058 0.834695
1 -1.149268 -0.163168 -1.557472 -0.107434 -0.732293 0.000000 -0.532036 0.667216 0.721058 0.834695
2 -1.671410 0.986133 -1.283913 -0.681677 0.786347 -0.058142 -0.320284 -1.070517 0.108996 -1.880449
3 1.565871 -0.845565 0.266258 0.617927 -0.732293 -0.762180 -1.061415 0.667216 0.721058 0.834695
4 1.670300 -0.665987 -1.010353 -0.681677 -0.732293 -0.762180 -0.563799 0.667216 0.721058 0.834695
In [0]:
#get subset of text X
X_test_lasso = X_test[features_by_lasso]

scaler2_1 = StandardScaler()

X_test_lasso_scaled = pd.DataFrame(scaler2_1.fit_transform(X_test_lasso))
X_test_lasso_scaled.columns = features_by_lasso

X_test_lasso_scaled.head()
Out[0]:
Consolidated version of risk markers Net Fraction Revolving Burden Number Satisfactory Trades Average Months in File Number of Inq Last 6 Months Number Bank/Natl Trades w high utilization ratio Months Since Oldest Trade Open Percent Trades Never Delinquent Max Delq/Public Records Last 12 Months Max Delinquency Ever
0 -0.861127 0.746455 0.990188 -8.964956e-01 0.208119 1.204539 -0.549313 0.133122 -3.706824 -2.476138
1 -0.237289 0.171193 -0.177041 -3.829377e-01 -0.613908 -0.729376 -0.668147 -0.854320 0.094177 -0.817274
2 0.000000 -0.188345 -0.087255 -4.292998e-16 0.000000 0.000000 -0.138797 0.000000 0.727677 0.841591
3 0.386549 -0.655746 0.271893 1.306202e-01 1.852173 -0.084738 0.109673 0.671727 0.727677 0.841591
4 -1.588938 -0.188345 -1.164697 8.254339e-01 -0.613908 0.000000 -0.354858 -0.315715 -3.706824 -0.264319

Keras Sequential with initial setting as baseline

In [0]:
multi_layer_classifier = keras.Sequential()
multi_layer_classifier.add(layers.Dense(10, input_dim =10, activation = 'relu'))
multi_layer_classifier.add(layers.Dense(50, activation = 'relu'))
multi_layer_classifier.add(layers.Dense(1, activation = 'sigmoid'))

multi_layer_classifier.compile(optimizer=keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

multi_layer_classifier.fit(X_train_lasso_scaled, y_train, epochs=150, batch_size=10,)

MLP_training_score = multi_layer_classifier.evaluate(X_train_lasso_scaled, y_train, batch_size=10)[1]
MLP_test_score = multi_layer_classifier.evaluate(X_test_lasso_scaled, y_test, batch_size=10)[1]

print('MLP training accuracy: ', MLP_training_score)
print('MLP test accuracy: ', MLP_test_score)

This section makes use of Keras Tuner to tune the model

In [0]:
!git clone https://github.com/keras-team/keras-tuner.git
!pip install ./keras-tuner
fatal: destination path 'keras-tuner' already exists and is not an empty directory.
Processing ./keras-tuner
Collecting tensorflow>=2.0.0-beta1
  Using cached https://files.pythonhosted.org/packages/d5/97/fbec42dfdb93a37ec971ca0996ff70b8eb5817789a9c1880aafd4684c9af/tensorflow-2.1.0rc1-cp36-cp36m-manylinux2010_x86_64.whl
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (0.16.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (1.17.4)
Requirement already satisfied: tabulate in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (0.8.6)
Requirement already satisfied: terminaltables in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (3.1.0)
Requirement already satisfied: colorama in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (0.4.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (4.28.1)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (2.21.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (1.2.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from keras-tuner==1.0.0) (0.21.3)
Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.0.8)
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.11.2)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.33.6)
Requirement already satisfied: keras-preprocessing>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.1.0)
Collecting tensorboard<2.2.0,>=2.1.0
  Using cached https://files.pythonhosted.org/packages/40/23/53ffe290341cd0855d595b0a2e7485932f473798af173bbe3a584b99bb06/tensorboard-2.1.0-py3-none-any.whl
Requirement already satisfied: protobuf>=3.8.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (3.10.0)
Requirement already satisfied: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.8.1)
Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.1.8)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.15.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.12.0)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.8.1)
Collecting tensorflow-estimator<2.2.0,>=2.1.0rc0
  Using cached https://files.pythonhosted.org/packages/18/90/b77c328a1304437ab1310b463e533fa7689f4bfc41549593056d812fab8e/tensorflow_estimator-2.1.0-py2.py3-none-any.whl
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (3.1.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.1.0)
Requirement already satisfied: gast==0.2.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.2.2)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->keras-tuner==1.0.0) (2019.11.28)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->keras-tuner==1.0.0) (1.24.3)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->keras-tuner==1.0.0) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->keras-tuner==1.0.0) (3.0.4)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->keras-tuner==1.0.0) (0.14.1)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from keras-applications>=1.0.8->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (2.8.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (3.1.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (42.0.2)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.16.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.8.2)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.4.1)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (4.0)
Requirement already satisfied: cachetools<3.2,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (3.1.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.2.7)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (1.3.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0-beta1->keras-tuner==1.0.0) (3.1.0)
Building wheels for collected packages: keras-tuner
  Building wheel for keras-tuner (setup.py) ... done
  Created wheel for keras-tuner: filename=keras_tuner-1.0.0-cp36-none-any.whl size=86620 sha256=d7be509f2520eb4687a77a29b622bbc853df18dcb978e82b9dd621bfa823aa0d
  Stored in directory: /root/.cache/pip/wheels/fa/4e/d0/ed331a3363786e8a74848aa0589674f502cbb3f0321cdba844
Successfully built keras-tuner
ERROR: tensorboard 2.1.0 has requirement grpcio>=1.24.3, but you'll have grpcio 1.15.0 which is incompatible.
Installing collected packages: tensorboard, tensorflow-estimator, tensorflow, keras-tuner
  Found existing installation: tensorboard 1.14.0
    Uninstalling tensorboard-1.14.0:
      Successfully uninstalled tensorboard-1.14.0
  Found existing installation: tensorflow-estimator 1.14.0
    Uninstalling tensorflow-estimator-1.14.0:
      Successfully uninstalled tensorflow-estimator-1.14.0
  Found existing installation: tensorflow 1.14.0
    Uninstalling tensorflow-1.14.0:
      Successfully uninstalled tensorflow-1.14.0
  Found existing installation: keras-tuner 1.0.0
    Uninstalling keras-tuner-1.0.0:
      Successfully uninstalled keras-tuner-1.0.0
Successfully installed keras-tuner-1.0.0 tensorboard-2.1.0 tensorflow-2.1.0rc1 tensorflow-estimator-2.1.0
In [0]:
from kerastuner.tuners import RandomSearch

Creating validation data set

In [0]:
X_sequential_train, X_sequential_val, y_sequential_train, y_sequential_val = train_test_split(X_train_lasso_scaled, y_train, test_size=0.2, random_state=10)
In [0]:
def build_model(hp):
    model = keras.Sequential()

    model.add(layers.Dense(10, input_dim=10, activation='relu'))
    model.add(layers.Dense(50, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(
        optimizer=keras.optimizers.Adam(
            hp.Choice('learning_rate',
                      values=[1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    
    return model

tuner = RandomSearch(
  build_model,
  objective='val_acc',
  max_trials=5,
  executions_per_trial=3,
  directory='keras_tuner',
  project_name='sequential')

tuner.search(X_train_lasso_scaled.values, y_train.values,
             epochs=5,
             validation_data=(X_sequential_val.values, y_sequential_val.values))

tuner.results_summary()
INFO:tensorflow:Reloading Oracle from existing project keras_tuner/sequential/oracle.json
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 95us/sample - loss: 0.6040 - acc: 0.6746 - val_loss: 0.5639 - val_acc: 0.7097
Epoch 2/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.5649 - acc: 0.7109 - val_loss: 0.5526 - val_acc: 0.7240
Epoch 3/5
8367/8367 [==============================] - 1s 78us/sample - loss: 0.5602 - acc: 0.7136 - val_loss: 0.5482 - val_acc: 0.7276
Epoch 4/5
8367/8367 [==============================] - 1s 77us/sample - loss: 0.5577 - acc: 0.7182 - val_loss: 0.5476 - val_acc: 0.7252
Epoch 5/5
8367/8367 [==============================] - 1s 76us/sample - loss: 0.5558 - acc: 0.7163 - val_loss: 0.5467 - val_acc: 0.7264
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 83us/sample - loss: 0.5861 - acc: 0.6924 - val_loss: 0.5581 - val_acc: 0.7204
Epoch 2/5
8367/8367 [==============================] - 1s 77us/sample - loss: 0.5608 - acc: 0.7154 - val_loss: 0.5513 - val_acc: 0.7276
Epoch 3/5
8367/8367 [==============================] - 1s 73us/sample - loss: 0.5579 - acc: 0.7160 - val_loss: 0.5545 - val_acc: 0.7198
Epoch 4/5
8367/8367 [==============================] - 1s 78us/sample - loss: 0.5559 - acc: 0.7181 - val_loss: 0.5469 - val_acc: 0.7360
Epoch 5/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5545 - acc: 0.7202 - val_loss: 0.5445 - val_acc: 0.7342
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 84us/sample - loss: 0.5835 - acc: 0.6939 - val_loss: 0.5575 - val_acc: 0.7168
Epoch 2/5
8367/8367 [==============================] - 1s 78us/sample - loss: 0.5619 - acc: 0.7101 - val_loss: 0.5512 - val_acc: 0.7240
Epoch 3/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5592 - acc: 0.7133 - val_loss: 0.5481 - val_acc: 0.7324
Epoch 4/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5575 - acc: 0.7148 - val_loss: 0.5473 - val_acc: 0.7318
Epoch 5/5
8367/8367 [==============================] - 1s 74us/sample - loss: 0.5562 - acc: 0.7154 - val_loss: 0.5466 - val_acc: 0.7318

Trial complete

Trial summary

|-Trial ID: abf8f311003e5976a71ce1d74e04e010
|-Score: 0.7319793105125427
|-Best step: 0

Hyperparameters:

|-learning_rate: 0.001
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 91us/sample - loss: 0.5755 - acc: 0.7097 - val_loss: 0.5570 - val_acc: 0.7234
Epoch 2/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5628 - acc: 0.7190 - val_loss: 0.5514 - val_acc: 0.7336
Epoch 3/5
8367/8367 [==============================] - 1s 76us/sample - loss: 0.5592 - acc: 0.7203 - val_loss: 0.5443 - val_acc: 0.7294
Epoch 4/5
8367/8367 [==============================] - 1s 73us/sample - loss: 0.5560 - acc: 0.7191 - val_loss: 0.5503 - val_acc: 0.7384
Epoch 5/5
8367/8367 [==============================] - 1s 73us/sample - loss: 0.5576 - acc: 0.7227 - val_loss: 0.5449 - val_acc: 0.7330
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 80us/sample - loss: 0.5703 - acc: 0.7097 - val_loss: 0.5585 - val_acc: 0.7145
Epoch 2/5
8367/8367 [==============================] - 1s 76us/sample - loss: 0.5604 - acc: 0.7164 - val_loss: 0.5474 - val_acc: 0.7252
Epoch 3/5
8367/8367 [==============================] - 1s 70us/sample - loss: 0.5581 - acc: 0.7194 - val_loss: 0.5461 - val_acc: 0.7336
Epoch 4/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.5605 - acc: 0.7153 - val_loss: 0.5438 - val_acc: 0.7366
Epoch 5/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.5595 - acc: 0.7196 - val_loss: 0.5465 - val_acc: 0.7354
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 84us/sample - loss: 0.5730 - acc: 0.7023 - val_loss: 0.5648 - val_acc: 0.7151
Epoch 2/5
8367/8367 [==============================] - 1s 77us/sample - loss: 0.5616 - acc: 0.7153 - val_loss: 0.5481 - val_acc: 0.7276
Epoch 3/5
8367/8367 [==============================] - 1s 73us/sample - loss: 0.5589 - acc: 0.7191 - val_loss: 0.5444 - val_acc: 0.7318
Epoch 4/5
8367/8367 [==============================] - 1s 77us/sample - loss: 0.5576 - acc: 0.7207 - val_loss: 0.5435 - val_acc: 0.7288
Epoch 5/5
8367/8367 [==============================] - 1s 74us/sample - loss: 0.5570 - acc: 0.7225 - val_loss: 0.5381 - val_acc: 0.7366

Trial complete

Trial summary

|-Trial ID: 05ca2708b01d8299d97f7bbab7b2d5c3
|-Score: 0.7371565699577332
|-Best step: 0

Hyperparameters:

|-learning_rate: 0.01
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 90us/sample - loss: 0.6767 - acc: 0.5964 - val_loss: 0.6448 - val_acc: 0.6744
Epoch 2/5
8367/8367 [==============================] - 1s 75us/sample - loss: 0.6283 - acc: 0.6808 - val_loss: 0.6105 - val_acc: 0.6989
Epoch 3/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.6024 - acc: 0.6903 - val_loss: 0.5894 - val_acc: 0.7013
Epoch 4/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.5867 - acc: 0.6988 - val_loss: 0.5777 - val_acc: 0.7073
Epoch 5/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.5779 - acc: 0.7013 - val_loss: 0.5712 - val_acc: 0.7121
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 87us/sample - loss: 0.6642 - acc: 0.6171 - val_loss: 0.6412 - val_acc: 0.6906
Epoch 2/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.6280 - acc: 0.6852 - val_loss: 0.6110 - val_acc: 0.7013
Epoch 3/5
8367/8367 [==============================] - 1s 79us/sample - loss: 0.6060 - acc: 0.6959 - val_loss: 0.5918 - val_acc: 0.7133
Epoch 4/5
8367/8367 [==============================] - 1s 80us/sample - loss: 0.5920 - acc: 0.7023 - val_loss: 0.5800 - val_acc: 0.7151
Epoch 5/5
8367/8367 [==============================] - 1s 76us/sample - loss: 0.5834 - acc: 0.7054 - val_loss: 0.5725 - val_acc: 0.7186
Train on 8367 samples, validate on 1674 samples
Epoch 1/5
8367/8367 [==============================] - 1s 88us/sample - loss: 0.6699 - acc: 0.6130 - val_loss: 0.6418 - val_acc: 0.6852
Epoch 2/5
8367/8367 [==============================] - 1s 71us/sample - loss: 0.6248 - acc: 0.6861 - val_loss: 0.6037 - val_acc: 0.7055
Epoch 3/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5972 - acc: 0.6943 - val_loss: 0.5819 - val_acc: 0.7121
Epoch 4/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5830 - acc: 0.6986 - val_loss: 0.5716 - val_acc: 0.7127
Epoch 5/5
8367/8367 [==============================] - 1s 72us/sample - loss: 0.5765 - acc: 0.7004 - val_loss: 0.5666 - val_acc: 0.7151

Trial complete

Trial summary

|-Trial ID: f328b53bdf01e1f76430647160045372
|-Score: 0.7152528762817383
|-Best step: 0

Hyperparameters:

|-learning_rate: 0.0001
INFO:tensorflow:Oracle triggered exit

Results summary

|-Results in keras_tuner/sequential
|-Showing 10 best trials
|-Objective(name='val_acc', direction='max')

Trial summary

|-Trial ID: 05ca2708b01d8299d97f7bbab7b2d5c3
|-Score: 0.7371565699577332
|-Best step: 0

Hyperparameters:

|-learning_rate: 0.01

Trial summary

|-Trial ID: abf8f311003e5976a71ce1d74e04e010
|-Score: 0.7319793105125427
|-Best step: 0

Hyperparameters:

|-learning_rate: 0.001

Trial summary

|-Trial ID: f328b53bdf01e1f76430647160045372
|-Score: 0.7152528762817383
|-Best step: 0

Hyperparameters:

|-learning_rate: 0.0001
In [0]:
multi_layer_classifier = tuner.get_best_models(num_models=1)[0]

multi_layer_classifier.fit(X_train_lasso_scaled.values, y_train.values)

print('Test accuracy:',accuracy_score(y_train.values, multi_layer_classifier.predict(X_train_lasso_scaled.values).round(0)).round(6))
y_pred = multi_layer_classifier.predict(X_test_lasso_scaled.values)
print('Test accuracy:',accuracy_score(y_test.values, y_pred.round(0)).round(6))
8367/8367 [==============================] - 1s 74us/sample - loss: 0.5573 - acc: 0.7213
Test accuracy: 0.726186
Test accuracy: 0.707935

Hyperparameter tuning of the mlp model requires deeper fundamental knowledge over deep neural network.
Layers and nodes are arbitrary, whereas other parameters take reference from other examples.
Models built are not tailored to the HELOC data set and thus higher accuracy is not expected.

SHAP

Due to its limited explainability, Deep Neural Network will be further explained both globally and locally, heavily depending on SHAP values.

In [0]:
deep_explainer = shap.DeepExplainer(multi_layer_classifier, X_train_lasso_scaled)
# calculate the shape value
shap_values = deep_explainer.shap_values(np.array(X_test_lasso_scaled))
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/shap/explainers/deep/deep_tf.py:124: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

Let's take a look at the global interpretation of fitted deep neural network.

  1. Variable Importance via SHAP
In [0]:
shap.summary_plot(shap_values[0], X_test_lasso_scaled)

Importances of the variables are listed from top to bottom (Consolidated version of risk markers explain the output the most).

The output seems intuitively plausible.

Consolidated version of risk markers, Average Month in File, Max Delq Records Last 12 Months, Number Satisfactory Trades affect the SHAP value negatively for having low values of it. Which makes sense since low values of them tend to attribute to high risk, thus pushing the output negatively.

Conversely, for Net Fraction Revolving Burden, utilization ratio attribute to output negatviely when it has high value, since the higher revolving burden and utilization ratio intuitively attribute to high risk.

  1. Partial Dependence Plot via SHAP
In [0]:
# partial dependecy plot for all the features selected by lasso

for i, feature in enumerate(X_train_lasso_scaled.columns):    
    shap.dependence_plot(feature, shap_values[0], X_test_lasso_scaled, feature_names=X_test_lasso_scaled.columns)

plt.show()

Most obvious dependece is found between Consolidated version of risk marker and utilization ratio. Low value of utilization ratio tends to have negative SHAP value while consolidated version or risk markers have low SHAP values. This can be explained by the fact that low utilization ratio means low risk, therefore leading to high credit score, positively affecting the riskflag (good risk flag).

Now, let's interpret the model on a global level. How the each value is affecting the prediction for each data instance? Below is how the 100th data instance is affected by each feature. Each contributions to prediction is calculated via SHAP.

In [0]:
shap.initjs()
shap.force_plot(deep_explainer.expected_value, shap_values[0][100,:], X_test_lasso_scaled.iloc[100,:], feature_names=X_test_lasso_scaled.columns)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

It can be easily found that positive version of risk markers positively affects the SHAP, negative fraction revolving burden tends to affect the SHAP positively, posivie percent trades never Delinquent positively affecting output SHAP.

Below is the integration of the force plot for every data instance.

In [0]:
shap.initjs()
shap.force_plot(deep_explainer.expected_value, shap_values[0], X_test_lasso_scaled, feature_names=X_test_lasso_scaled.columns)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Conclusion

Logistic GAM with B spline reported 72.01% training accuracy and 71.89% test accuracy.
Gradient boosting reported 73.78% training accuracy and 70.89% test accuracy.
MLP reported 72.62% training accuracy and 70.79 test accuracy.

All models had similar accuracy (~1% gap). Hyperparameter tuning did not improve accuracy by a significant portion. This leads us to believe to achieve higher accuracy, the effects of better feature selection should be more significant than hyperparameter tuning. Further improvement on feature selection is recommended.

Finally, logistic GAM with b-splines is recommended. It has the highest test accuracy and smallest training accuracy to test accuracy gap. In addition, logistic GAM is a whitebox model. Therefore, it is more interpretable than the other two blackbox models we've used.